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Abstract — 

Background: Microarray technology has revolutionized the way genomic analysis has been performed. High-throughput 
data acquisition, brought up a challenge in data comprehension i.e. in gene expression. 

Methods: k-means cluster obtained after analysis of miRNA expression data have been sorted by an algorithmic procedure. 
Results: The proposed method managed to sort k-means centroids and manifest a more simple way of drawing conclusions 
on studied tumor samples. miRNAs were unraveled that changed in expression levels with respect to tumor aggressiveness. 
Conclusions: In the present work we presented a new and simple approach in data analysis using a new analysis approach, 
which we termed sorted-k-means analysis. 
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I. Introduction 

Microarray technology is based on the basic property of nucleic acids, which is the selective binding of two complementary 
chains/sequences. The basic technological idea on the exploitation of this property already existed [1, 2], but what gave a 
huge rash into this technology was the discovery of microarrays. However, this could not have been feasible without the 
unraveling of the sequence of the human genome. At the same time, advances in technological aspects such as the 
miniaturization of arrays and high-density printing on a solid substrate have allowed the appearance of microarray chips 
[3]. The advent of this technology has given the initial idea that questions arising for cellular and molecular events could 
easily be answered through a comparison between the “control” and the “investigated” samples. Microarray technology was 
initially applied in molecular investigations on the genome level at the end of 90’s[4, 5]. DNA microarrays detect patterns of 
gene expression, therefore they can be used for acquiring such “images” and the induction of conclusions on cell state [6]. 
cDNA microarrays have been used for a plethora of experiments; virtually any property of a DNA sequence which can be 
experimentally modified may be determined as far as its differential expression is concerned, and this can be performed on 
thousands of sequences simultaneously. Research questions that can be answered with DNA microarrays are related mainly 
to the investigation of gene expression. They can compare the relative abundance of mRNA (or microRNA; miRNA) of a 
gene under investigation, between different cells or tissue samples. For example, an experiment could compare cells before 
and following an experimental intervention, or at successive moments of a specific process, or between stages of 
differentiation or mRNA expressed in a mutant cell compared to that of wild type. This would be the simplest type of 
experiment. In particular, microarrays have been applied for the diagnosis of cancer [7, 8], They have been used to 
investigate the hypothesis that the classification of cancers can be based on their gene expression profiles, subsequently 
eliminating the need for histopathological diagnosis [9]. Microarray analysis has been used to predict “tumor grades” or 
subtypes of cancers, regardless of prior knowledge of their biology [10]. They can also provide an opportunity to study the 
possibility of tumor gene expression in correlation with the prediction of the disease outcome. In general, microarray 
platforms afford attractive methodologies for discovery -based investigations [11]. 

In the current study, we attempted to present a novel methodology for the analysis of gene expression data using a unique k- 
means approach which we defined as sorted k-means. We manifested the suggested algorithm by using a previously studied 
dataset, from childhood Central Nervous System (CNS) tumors. 
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II. Materials and Methods 

2.1 Samples 

The previously studied childhood CNS malignancies were investigated for their miRNA expression profiles [12], The 
aforementioned study included in total,, 26 resected brain tumors from children diagnosed with pilocyticastrocytomas (PA) 
(n=19) and ependymomas (EP) (n=7). Additionally, we included glioblastomas (GB) (n=12) (online data: E-MEXP- 
1029* [ 1 3]), germinomas (GE) (n=12) (GS El 9347") and dysembryoblasticneuroepithelial tumors (DNET) (n=4). All were 
diagnosed according to the 2007 WHO criteria [14], As controls, 17 samples were used; The First-Choice Human Brain 
Reference RNA was used (Ambion, Austin, TX, USA) and 16 samples were obtained from deceased children who 
underwent autopsy and were not present with any brain distortion, including the following anatomic locations: cerebellum 
(n=4), medulla oblongata (n=4), parietal lobe (n=4) and temporal lobe (n=4). 

2.2 MicroRNA Profiling 

The miRNA profiling was performed as described by Braoudaki et al., 2014 [11, 15]. In brief, total RNA and miRNAs were 
extracted using the Trizol standard protocol (Invitrogen, Carlsbad, CA) and the mirVANA miRNA isolation kit (Ambion, Austin, 
TX). Labelling and hybridization were carried out using the LabellT miRNA labelling kit (Mirus Bio LLC, USA) following the 
manufacturer’s instructions. All specimens were hybridized to n Applied MicroArrays (miRlinkBioarray 300054-3PK) platform 
and all images were scanned using the Agilent Microarray Scanner (G2565CA) controlled by Agilent Scan Control 7.0 software. 
The total gene signals were extracted using the Imagene 6.0 software (Biodiscovery Inc., USA). MicroRNAs were significantly 
differentially expressed (DEx) when they obtained a p-value< 0.05 and a false discovery rare; FDR<0.05. Overall, our analysis 
revealed 70 DE miRNAs. 

2.3 Data Analysis 

The multiparameter analyses were performed with MATLAB® simulation environment (The Mathworks, Inc., Natick, MA). 
Microarray data were processed as previously reported [15]. In brief, filtering was performed based on the signal intensity. 
Background correction was carried out by subtracting the median local background from the signal intensity as previously 
reported [16]. Normalization was performed using the quantile normalization algorithm. The two tailed student t-test was 
used to test the mean differences between two groups. MicroRNAs were considered to be significantly differentially 
expressed (DEx) if they obtained a p-value<0.05 and an FDR<0.05. MiRNA expression levels were further analyzed with the 
k-means methodology. A- means is a method of cluster analysis, which partitions/; observations into k clusters, in which each 
observation belongs to the cluster with the nearest mean. Given a set of observations (x h x 2 , .... x n ), where each observation is 
a //-dimensional real vector, then k-means clustering aims to partition the n observations into k sets (k<n) S=[Si, S 2 , ..., S t } 
so as to minimize the within-cluster sum of squares. Further on, centroids were calculated and were sorted in an ascending 
order. Sorting was performed with the MATLAB computing environment. Each cluster was transformed in a DataMatrix 
structure, where rows indicated miRNAs and columns designated the tumor samples. The DataMatrix was then sorted with 
respect to the column and plotted, respectively. This was repeated for each k-means cluster separately. The k-means 
implementation in MATLAB has a randomized component, which is the selection of initial centers. This implies that every 
time the methodology will yield different results. Yet, our methodology sorts the produced centroids every single time 
accounting for the random effect of the MATLAB k-means algorithm. MiRNA annotation was performed with the 
Webgestalt 1 2 3 on-line tool [17, 18]. 


III. Results 

MiRNA expression profiles were clustered with k-means with respect to all CNS tumor samples, in random order (Fig. 1A). 
At the same time, centroids were also randomly calculated (Fig. IB). The calculated centroids were sorted and samples 
appeared in ascending order with respect to ascending miRNA expression levels (Fig. 1C). Samples were sorted and 
manifested a pattern with respect to all samples. This type of sorting gave the opportunity to examine patterns of expression 
with respect to the complete sampling. The next step in the evaluation of k-means clustering was to calculate the mean of 
each tumor type with respect to each miRNA. Those miRNAs and tumor types were clustered in random order, where it 


1 http://www.ebi. ac.uk/arravexpress/experiments/E-MEXP-1029/ 

2 http://www.ncbi.nlni.nih.gov/geo/aucrv/acc.cgi?acc=GSE 19347 

3 http://bioinfo. vanderbilt.edu/webgestalt/ 
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appeared that it was not easy to extract rapid conclusions (Fig. 2A, 2B). Sorting the k-means clusters’ centroids made easier 
to identify patterns of expression. In particular, it appeared that sorted centroids in an ascending order revealed specific 
patterns. In particular, red-shaded clusters indicated clusters where sample size coincided with tumor aggressiveness (Fig. 
2C). In sorted clusters 13, 20, 22, 29, 31 and 34, tumors were presented from the most aggressive tumor (GE) to the most 
benign (DE) (Fig. 2C). In addition, in cluster 33, miRNAs were classified with respect to aggressiveness from the benign 
(DE) to the most aggressive tumor type (GE) (Fig. 2C).This type of analysis also revealed significant differences between 
tumor types. For example, in cluster 13, the miRNA expression levels were significant between germinoma (GE) and DNET 
(DE) samples. The suggested algorithm could successfully sort tumor types with respect to aggressiveness, both in 
descending as well as ascending order. In particular, genes appeared to increase from aggressive to benign neoplasms. 
Additionally, annotation analysis showed that miRNAs that are expressed in such a pattern participated in both hematological 
malignancies and in neuroectodermal tumors (Supplementary Table 1). The individual k-means clusters are provided as 
supplementary data (file: kmeans_Group_Quantile.xlsx). In addition, the code that generated the suggested algorithm is 
provided as supplementary data (file: MATLAB Code for Sorted k-means. docx). 
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Figure 1. k-means clustering of the complete sample size. All individual samples were 

CLUSTERED (A). CENTROIDS WERE CALCULATED IN THE SAME ORDER AS SAMPLES WERE CALCULATED 
(B) AND FURTHER ON, CENTROIDS WERE SORTED IN AN ASCENDING ORDER (C) 
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Fig. 2. The mean expression values of each miRNA were calculated with respect to the tumor type 
i.e. Dysembryoblastic Neuroepithilial Tumors (DNET designated as DE), Glioblastoma (Designated as 
GB), Pilocytic Astrocytoma (designated as PA), Ependymomas (designated as EP) and Germinomas 
(Designated as GE). Mean miRNA expression levels were clustered in random order (A), whereas the 

CENTROIDS WERE ALSO CLUSTERED IN RANDOM ORDER (B). ON THE OTHER HAND, CENTROIDS WERE SORTED IN AN 
ASCENDING ORDER WHERE SPECIFIC PATTERNS WERE REVEALED. IN PARTICULAR, RED-SHADED CLUSTERS INDICATE 
CLUSTERS WHERE SAMPLE SIZE COINCIDES WITH TUMOR AGGRESSIVENESS (C). IN SORTED CLUSTERS 13, 20, 22, 29, 31 
AND 34 TUMORS WERE PRESENTED FROM THE MOST AGGRESSIVE TUMOR (GE) TO THE MOST BENIGN (DE) (C). IN 
ADDITION, IN CLUSTER 33, MlRNAS WERE CLASSIFIED WITH RESPECT TO AGGRESSIVENESS FROM THE BENIGN (DE) 

TO THE MOST AGGRESSIVE TUMOR TYPE (GE) (C). 

IV. Discussion 

K-means methodology is an extremely useful tool in the analysis of high-throughput gene expression data. Although useful, 
the output of k-means clustering makes it difficult to extract conclusions especially in the case of random pre-disposition of 

Reader can request for Supplementary data by emailing to glamprou@med.uoa.gr or info.iioer@gmail.com Page | 


International Journal of Engineering Research & Science (IJOER) 


ISSN: [2395-6992] 


[Vol-2, Issue-8, August- 2016] 


samples. The proposed method transforms k-means output data in such a way that cluster centroids were presented sorted 
with respect to gene expression levels. To the best of our knowledge, this is the first report, in which such an algorithm is 
proposed. In this sort of representation we were able to distinguish tumor types with respect to aggressiveness. Also, several 
miRNAs were found to ascend with respect to tumor aggressiveness, i.e. manifesting increasing expression levels as tumor 
grade decreased (from very aggressive to benign), which easily hinted towards groups of miRNAs that serve as possible 
tumor suppressor markers. The opposite pattern was also observed, i.e. miRNA levels increasing from benign neoplasms to 
more aggressive. This observation also led toward a group of putative tumor suppressor miRNAs. For example, the data used 
included tumors ranging from benign (e.g. DNET) to very aggressive (e.g. GB and GE), where both patterns were detected 
and miRNAs were identified as possible markers involved in tumor progression or tumor inhibition. For example, as 
previously reported, miR-184 and miR-766 potentially afford a oncogenic markers, which is consistent to our observations 
[19-21], Similarly, we found that miR-1 was up-regulated as tumor grade increased, while other reports referred to this 
miRNA as a putative tumor suppressor molecule [22-24], and a possible novel therapeutic marker. Data sorting can be 
performed for several characteristics, besides tumor grade and it is possible to discover patterns based on other clinical 
phenotypes. The proposed algorithm is very simple and easy to use, yet it could be of important assistance in the 
comprehension of complicated datasets including microarray expression data. 

V. Conclusion 

In the present work we presented a new and simple approach in data analysis using a new analysis approach, which we 
termed sorted-k-means analysis. In several cases, k-means classification provides useful insight towards the understanding of 
biological data. Our analysis expands this potential and provides a further classification step, which might assist in the easier 
and more comprehensive understanding of complex microarray data. 
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